## [1] "/Users/mike/Documents/Digital/Other/Training/Udacity Nanodegree/Project 3/Final"
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## [1] 1599
## [1] 1599 13
## alcohol : num [1:1599] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## chlorides : num [1:1599] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## citric.acid : num [1:1599] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## density : num [1:1599] 0.998 0.997 0.997 0.998 0.998 ...
## fixed.acidity : num [1:1599] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## free.sulfur.dioxide : num [1:1599] 11 25 15 17 11 13 15 15 9 17 ...
## pH : num [1:1599] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## quality : int [1:1599] 5 5 5 6 5 5 5 7 7 5 ...
## residual.sugar : num [1:1599] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## sulphates : num [1:1599] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## total.sulfur.dioxide : num [1:1599] 34 67 54 60 34 40 59 21 18 102 ...
## volatile.acidity : num [1:1599] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## X : int [1:1599] 1 2 3 4 5 6 7 8 9 10 ...
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 132 33 50 30 29 20 24 22 33 30 35 15 27 18 21
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 19 9 16 22 21 25 33 27 25 51 27 38 20 19 21
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 30 30 32 25 24 13 20 19 14 28 29 16 29 15 23
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 22 19 18 23 68 20 13 17 14 13 12 8 9 9 8
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 9 2 1 10 9 7 14 2 11 4 2 1 1 3 4
## 0.75 0.76 0.78 0.79 1
## 1 3 1 1 1
There are 1599 wines in the dataset with 12 features (residual.sugar, density, quality, fixed.acidity, chlorides, pH, volatile.acidity, free.sulfur.dioxide, sulphates, citric.acid, total.sulfur.dioxide, alcohol). None of the variables are ordered factor, but all numeric or integer values
Other observations: -The median quality is 6.0 ranging from a min of 3 and max of 8 on a scale of 0-10. -The quality has the following number of samples ( 3-10, 4-53, 5-681, 6-638, 7-199, 8-18) -The alocohol content of the red wine ranges between 8.4% and 14.90% with 75% of the red wines below 11.1%
The main feature of the data set is quality. I’d like to determine which features have the greatest impact on the quality of red wine.
Alcohol, fixed.acidity, volatile.acidity, citric.acide, chlorides, total.sulfur.dioxide, density, sulphates, and alcohol are likely to contribute to the quality of red wine.
No, I did not create any new variables.
Fixed.acidity, volatile.acidity, density, pH, alcohol, and quality are close to normal distributions. Residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, and sulphates are skewed to the left in their distribution. Citric.acid is somewhat evently distributed, but appears to have a lot of values at 0 (132 total). It also appears that a few of the features such as residual.sugar, chlorides, free.sulfur.dioxodie, total.sulfur.dioxoide, and sulphates have outliers that could impact the analysis. I log transformed the left skewed distributions.
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.26848392 -0.008815099
## fixed.acidity -0.268483920 1.00000000 -0.256130895
## volatile.acidity -0.008815099 -0.25613089 1.000000000
## citric.acid -0.153551355 0.67170343 -0.552495685
## residual.sugar -0.031260835 0.11477672 0.001917882
## chlorides -0.119868519 0.09370519 0.061297772
## free.sulfur.dioxide 0.090479643 -0.15379419 -0.010503827
## total.sulfur.dioxide -0.117849669 -0.11318144 0.076470005
## density -0.368372087 0.66804729 0.022026232
## pH 0.136005328 -0.68297819 0.234937294
## sulphates -0.125306999 0.18300566 -0.260986685
## alcohol 0.245122841 -0.06166827 -0.202288027
## quality 0.066452608 0.12405165 -0.390557780
## citric.acid residual.sugar chlorides
## X -0.15355136 -0.031260835 -0.119868519
## fixed.acidity 0.67170343 0.114776724 0.093705186
## volatile.acidity -0.55249568 0.001917882 0.061297772
## citric.acid 1.00000000 0.143577162 0.203822914
## residual.sugar 0.14357716 1.000000000 0.055609535
## chlorides 0.20382291 0.055609535 1.000000000
## free.sulfur.dioxide -0.06097813 0.187048995 0.005562147
## total.sulfur.dioxide 0.03553302 0.203027882 0.047400468
## density 0.36494718 0.355283371 0.200632327
## pH -0.54190414 -0.085652422 -0.265026131
## sulphates 0.31277004 0.005527121 0.371260481
## alcohol 0.10990325 0.042075437 -0.221140545
## quality 0.22637251 0.013731637 -0.128906560
## free.sulfur.dioxide total.sulfur.dioxide density
## X 0.090479643 -0.11784967 -0.36837209
## fixed.acidity -0.153794193 -0.11318144 0.66804729
## volatile.acidity -0.010503827 0.07647000 0.02202623
## citric.acid -0.060978129 0.03553302 0.36494718
## residual.sugar 0.187048995 0.20302788 0.35528337
## chlorides 0.005562147 0.04740047 0.20063233
## free.sulfur.dioxide 1.000000000 0.66766645 -0.02194583
## total.sulfur.dioxide 0.667666450 1.00000000 0.07126948
## density -0.021945831 0.07126948 1.00000000
## pH 0.070377499 -0.06649456 -0.34169933
## sulphates 0.051657572 0.04294684 0.14850641
## alcohol -0.069408354 -0.20565394 -0.49617977
## quality -0.050656057 -0.18510029 -0.17491923
## pH sulphates alcohol quality
## X 0.13600533 -0.125306999 0.24512284 0.06645261
## fixed.acidity -0.68297819 0.183005664 -0.06166827 0.12405165
## volatile.acidity 0.23493729 -0.260986685 -0.20228803 -0.39055778
## citric.acid -0.54190414 0.312770044 0.10990325 0.22637251
## residual.sugar -0.08565242 0.005527121 0.04207544 0.01373164
## chlorides -0.26502613 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.07037750 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide -0.06649456 0.042946836 -0.20565394 -0.18510029
## density -0.34169933 0.148506412 -0.49617977 -0.17491923
## pH 1.00000000 -0.196647602 0.20563251 -0.05773139
## sulphates -0.19664760 1.000000000 0.09359475 0.25139708
## alcohol 0.20563251 0.093594750 1.00000000 0.47616632
## quality -0.05773139 0.251397079 0.47616632 1.00000000
##
## Pearson's product-moment correlation
##
## data: quality and fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
##
## Pearson's product-moment correlation
##
## data: quality and volatile.acidity
## t = -16.9542, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
##
## Pearson's product-moment correlation
##
## data: quality and citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
##
## Pearson's product-moment correlation
##
## data: quality and log10(residual.sugar)
## t = 0.9407, df = 1597, p-value = 0.347
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.02551727 0.07247084
## sample estimates:
## cor
## 0.02353331
##
## Pearson's product-moment correlation
##
## data: quality and log10(chlorides)
## t = -7.1508, df = 1597, p-value = 1.308e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2232336 -0.1282260
## sample estimates:
## cor
## -0.17614
##
## Pearson's product-moment correlation
##
## data: quality and log10(free.sulfur.dioxide)
## t = -2.0041, df = 1597, p-value = 0.04522
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.098865884 -0.001068979
## sample estimates:
## cor
## -0.05008749
##
## Pearson's product-moment correlation
##
## data: quality and log10(total.sulfur.dioxide)
## t = -6.8999, df = 1597, p-value = 7.476e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2173510 -0.1221403
## sample estimates:
## cor
## -0.1701427
##
## Pearson's product-moment correlation
##
## data: quality and density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
##
## Pearson's product-moment correlation
##
## data: quality and pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
##
## Pearson's product-moment correlation
##
## data: quality and log10(sulphates)
## t = 12.9672, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2636092 0.3523323
## sample estimates:
## cor
## 0.3086419
##
## Pearson's product-moment correlation
##
## data: quality and alcohol
## t = 21.6395, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and citric.acid
## t = -26.4891, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
Quality has a strong correlation to alcohol, sulphates, citric.acid and a negative relationship to volatile.acidity. These relationships make sense based upon each attribute. Volatile.acidity is the amount of acetic acid in the wine and a higher value means more of an unpleasant, vinegar taste. Citric.acid can add freshness and flavor to wines. Sulphaes can help keep wine fresh. Total.sulfur.dioxide appears to have a lower relationship since low amounts are prevalent in lower and higher quality wines whereas higher amounts exist in mid-quality wines. Total.sulfur.dioxide (SO2) becomes evident over 50 ppm and becomes evident in the nose and taste of wine, which is why they are rated at the mid-level.
Quality also has a smaller correlation to fixed.acidity, chlorides, density. Citric.acid has a very strong relationship to fixed.acidity and a negative relationship to volatile.acidity. Since all of these are acids, they impact pH level of the wine. Density is a result of alcohol and sugar within the wine. Free.sulfur.dioxide has an impact on the total.sulfur.dioxide of the wine as well.
The strongest relationship to quality is alcohol. Beyond quality, it was the fixed.acidity to pH and the fixed.acidity to density.
Exploring alcholol, sulphates, citric.acid, volatile.acidity, and total.sulfur.dioxide levels and the impact on quality, I was able to show a strong relationship between alcohol, acidity (higher citric.acid and lower volatile.acidity), and sulphates. Alcohol clearly has the largest impact but sulphates and citric.acid also show an interesting relationship since the higher quality wines were plotted in the upper right of the graphs for those features. On the other hand, due to the negative correlation, volatile.acidity and alcohol were plotted in the lower right.
The relationship between a lower volatile.acidity and higher citric.acid is more prevelant in the diagram. This supports that citric.acid adds flavor and freshness and the volatile.acidity negatively impacts the flavor.
By plotting the quality of the wine to the alcohol content, we are clearly able to see the relationship between the two. As the alcohol content increases, the quality of the wine also increases on a near linear scale. There is some overlap based on alcohol content between the different quality ratings which means that alcohol is important but it is also a combination of other factors that play a role in the quality of a wine.
We were able to determine that a strong correlation exists between alcohol and quality and a strong negative correlation exists between quality and volatile.acidity. This graph shows that relationship where a higher alcohol content and lower volatile.acidity produces a higher quality wine (better wines in lower left of graph).
This boxplot demonstrates the effect of citric.acid on the quality of wine. We also only see one outlier from the dataset. Plotting the median along with the boxplot shows the increase of citric.acid along with the quality. As a result, we are able to determine that the higher the citric.acid levels in a wine, the better the quality rating.
I was able to investigate the different features of the data set and perform an analysis to determine which had the greatest impact on quality. The features that factored into quality the most were alcohol content, sulphates, and acidity (citric.acid and volatile.acidity). The correlations and graphs illustrated the relationships between these features and the trends that resulted from increasing or decreasing the amount of each in wine. Although there may be some variation, the highest quality wines were higher in alcohol content, sulphates, and citric.acid while having a lower volatile.acidity. This resulted in the freshest, best tasting wines that were desired most by the experts rating the wines. The analysis could be enriched by performing a more in depth comparison of the relationships between all of the different features. I only looked at the top 5 correlations so this analysis could help provide more information to support the conclusion. I also could have accounted for the entries of 0 for citric.acid (or other data quality). Overall, the analysis provided a great opportunity to explore the data set and enforce the skills learned through the lessons.